Add GLM 5 MTP#1513
Conversation
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes.
…or Draft Model).
|
I'm really only using |
|
Are there any standardized tests to check the scores of the LLM regarding the tool-call performance etc. that can be ran locally? |
Not really, but you will very quickly find out if it starts hallucinating the tool calls in the chat ( |
|
Okay what about using smol-IQ1_KT fully GPU-offloaded as a draft for a larger quant with only offloaded head and the KV-cache? Having about 31 tps decode at zero ctx and 21 tps at 32k ctx. [EDIT]: naaah. I don't think its worth it. 21 tps at 32k ctx is already slow enough. Hmm... I should probably finally try with double EPYC. |
@ikawrakow To be honest, I already had the GLM5 and use it fairly often, so I wanted to add it to have a point of comparison. As for other MTPs, I don’t plan on adding them for now, especially since we don’t retain the layer and it’s unlikely anyone would want to re-quantize just to test a slow feature.
@jukofyork With MLA 1 or 3 I saw slightly lower performance, for me the best performance was: no MLA > MLA3 > MLA1. To be honest, I haven’t been fine-tuning the arguments for a while, but since you mentioned -draft-min, I have an idea in mind that might help better define that parameter, I’ll see how it works in practice later.
@magikRUKKOLA Could you give me some details about the arguments used? I tested it with Kimi K2.5, thinking there was an incompatibility with MTP, then I tested it with GLM5 without MTP and didn't get any errors. |
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
--model /opt/ubergarm/GLM-5-GGUF/smol-IQ2_KS/GLM-5-smol-IQ2_KS-00001-of-00006.gguf \
--alias ubergarm/GLM-5-smol-IQ2_KS \
--ctx-size $((128 * 1024)) \
-b $((1024)) -ub $((1024)) \
--mlock \
--temp 0.0 --top-p 1.0 --top-k 0 \
-ctk q6_0 \
-ctv q6_0 \
-mtp \
-khad \
-ger \
-smgs \
-sas \
-muge \
-mea 16 \
-amb 16 \
--merge-qkv \
--graph-reduce-type bf16 \
--split-mode layer \
--main-gpu 0 \
--max-gpu 0 \
--n-gpu-layers 99 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--special \
--verbosity 1 \
--verbose-prompt \
--reasoning-format auto \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1[EDIT]: woops. I had to use |
On mainline |
|
[EDITED]:
|
What arguments should I use once again? How to set the draft size ? [EDIT]: Oh. I see. So via the |
Its with |
|
@magikRUKKOLA I wasn't able to reproduce the same error with your arguments, the only difference was that I couldn't fully offload to the GPU with such a large model. That said, there were some errors that occurred, and they were fixed after the most recent rebase of the branch. Since your first test was done before that, please try making a new pull. To provide more context, the models that have MTP and support it are GLM 4.5/4.6/4.7 and 5.0. You can try running the -mtp command with any other model, and it will be disabled (I used Kimi K2.5 as a test to see if this logic was causing your crash before). Currently, MTP only supports --draft-max and --draft-p-min
@jukofyork I believe that certain parameters, such as draft-max, draft-min, and p-min, could be optimized, perhaps using a controller that can adjust the parameters based on the hit rate of the speculative models. Since you’re running some tests, are there any parameters you’d like me to test? |
|
Aha! Yes, it does not crash indeed. Its like without Overall, with |
Don't worry, one day it will be optimized enough to be worth it (I hope). |
|
Should I re-try with hybrid inference? |
See the posts in this thread, starting here: ggml-org/llama.cpp#10466 (comment) I tried to simplify it to the bare minimum here: but nobody seemed interested and mainline The key thing from all my experiments is that you can't really just use a fixed
Some kind of adaptive controller would be the next step, but there was pretty much zero interest in that discussion and PR... I'm also not convinced the current logic is correct: ggml-org/llama.cpp#10466 (comment) The code has got so many tricky optimisations in it now though, but I think you can show that if If you look at the costs for my |
@magikRUKKOLA If you want to test whether the GLM5 MTP code works, go ahead I appreciate it, but in terms of performance, it shouldn't make much of a difference.
@jukofyork This is a great material, I need more time to read through the details, but I’ll definitely use it when I start working on this feature. I believe parameter inferences can be made in real time, which allows for adapting the settings to the user’s needs and use cases. At the end of the session, a snapshot of the current metrics could be provided so that the user can use it as a default in the future if they wish. |
|
GLM5 IQ2_KL without with |
The performance loss is consistent with my tests, which leads me to believe that the initial gains will be in hybrid/CPU-only inference, but that in the future the main gains will come from the GPU. |
|
I did a rebase and the gap has narrowed, but it’s still there: Benchmark (
|
| Mode | Code | Extract | Story | Overall | Accept rate |
|---|---|---|---|---|---|
| Baseline | 11.3 t/s (1009 tok) | 11.4 t/s | 11.3 t/s | 11.3 ± 0.0 t/s | N/A |
MTP draft-max 1 |
9.7 t/s (832 tok) | 9.9 t/s | 8.9 t/s | 9.5 ± 0.5 t/s | 85.1% ± 12.2% |
MTP draft-max 2 |
8.8 t/s (832 tok) | 10.0 t/s | 8.1 t/s | 9.0 ± 0.9 t/s | 59.5% ± 12.8% |
The gap remains between 17% and 22%, and the drop in the acceptance rate to 60% catches my attention, there may be some issue with the MTP embeddings, and hopefully it would be worth testing with draft 3.
| cur = llm_build_norm(ctx0, cur, hparams, mtp_layer.attn_norm, NULL, LLM_NORM_RMS, cb, il); | ||
| cb(cur, "attn_norm", il); | ||
|
|
||
| { |
There was a problem hiding this comment.
Apart from the above construction of the MTP input, is this function just a copy of build_deepseek2 for one layer?
There was a problem hiding this comment.
Yes, mtp is a typical one-layer decoder after the inputs.
I reviewed it, and the architecture matches SGLang, there was just a small fix missing in the post-layer. I also checked the embeddings just to be sure, and they match. I reran the benchmark along with a new rebase, and the performance remains consistent with what I highlighted in my last test. I also tested with --draft-max 3, yielding 8.1 ± 1.2 t/s overall and 45.8% ± 12.6% accept.

Add mtp support for GLM-5, to try use the args -mtp to activate and --draft-max, --draft-p-min to control how much tokens you want to generate.
Test's applied
I copied the "Top" YouTube section from Wikipedia: https://en.wikipedia.org/wiki/YouTube#GLM 5 smol-IQ2_KS - Draft size = 10, p-min = 0.85, -ot "blk.78..*=CUDA1", --seed 42
Without MTP vs With MTP